Nareed Hashem - Shiraz Fero
1.1. The Data
1.2. Research Question
1.3. Our Approach
2.1. Data Cleansing
2.2. Create the Network
3.1. Degree Distributions
3.2. Betweenness Distributions
4.1. Building the new graph
4.2. Finding most central sub-category
5.1. Again, central sub-category
5.2. ٍStrongest edge
We chose the Cora Citation Network, it is a directed network where nodes represent scientific papers, an edge between two nodes indicates that the left node cites the right node, the edges are unweighted. In addition the papers are classified to categories and sub-categories.
What is the most cited sub-category that all other categories depend on?
Most central sub-category - one that most of its papers are cited in papers from a different sub-category.
To answer the research question we're going to build a new graph where each node is a collection of articles from the same sub-category. This new graph will be directed and weighted where the weight is calculated by: % of papers that have an out degree >= 1 times % of papers that have in degree >=1
# Import packages
import numpy as np
import os
import pandas as pd
import networkx as nx
import time
from random import sample
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import plotly
import plotly.graph_objects as go
import plotly.express as px
# edges.csv has 91500 rows, nodes.csv has 23166 rows
edges = pd.read_csv('./data/edges.csv', delimiter=' ')
nodes = pd.read_csv('./data/nodes.csv', delimiter=' ')
nodes = nodes[['network_id', 'node_id']]
#get id's, get each node id's category and concat them together in names df
names = pd.read_csv('./data/ent.subelj_cora_cora.id.csv', delimiter=' ')
cat = pd.read_csv('./data/ent.subelj_cora_cora.class.csv', delimiter=' ')
names['category'] = cat[['category']]
result = pd.merge(nodes, names, on='node_id')
result = result.drop(['node_id'], axis=1)
# split category and sub-category
cat_df = result.copy()
cat_df['category'] = cat_df['category'].astype(str)
for index, row in cat_df.iterrows():
split = cat_df.loc[index]['category'].split('/')
cat = split[1]
cat = cat.replace('_', ' ')
cat_df.loc[index, 'category'] = cat
subcat = split[2]
subcat = subcat.replace('_', ' ')
cat_df.loc[index, 'sub_cat'] = subcat
Final data contains two dataframes one for the edges and the other includes each node and its category and sub-category, an example:
edges.head(2)
cat_df.head(2)
from our data, create a directed graph using networkx library:
edges_tuple = [tuple(x) for x in edges.to_numpy()]
DG = nx.DiGraph()
DG.add_edges_from(edges_tuple)
Some of the network's properties:
print(nx.info(DG))
print("Is the Graph directed? " + str(DG.is_directed()))
print("Graph Density is: " + str(nx.density(DG)))
print("Average Clustering: " + str(nx.average_clustering(DG, nodes=None, weight=None, count_zeros=True)))
'''
Given the "list" of degrees we get from xnetwork this method turns it into a
df with two columns one for degree and other count
'''
def dictionary_to_df(df, measure):
all_df = []
for node,deg in list(df):
all_df.append(deg)
unique = list(set(all_df))
count = []
for i in unique:
x = all_df.count(i)
count.append(x)
dfout = pd.DataFrame(list(zip(unique, count)),
columns =[measure, 'Freq'])
return dfout
import plotly.io as pio
pio.renderers.default = "notebook"
%matplotlib inline
out_deg = DG.out_degree()
in_deg = DG.in_degree()
out_df = dictionary_to_df(out_deg, 'Degree')
in_df = dictionary_to_df(in_deg, 'Degree')
# Plot the in and out degree distribution
fig = plotly.subplots.make_subplots(rows=1, cols=2, horizontal_spacing=0.1,
subplot_titles=("In-Degree Distribution","Out-Degree Distribution"),
specs=[[{"type": "xy"},{"type": "xy"}]])
fig.add_trace(
go.Scatter(x=in_df['Degree'], y=in_df['Freq'],marker_symbol='hexagon2', mode="markers+text",
marker=dict(size=12,color='rgba(135, 206, 250, 0.7)', line=dict(width=1, color='DarkSlateGrey'))), row=1, col=1)
fig.add_trace(
go.Scatter(x=out_df['Degree'], y=out_df['Freq'],marker_symbol='hexagon2', mode="markers+text",
marker=dict(size=12,color='rgba(135, 206, 250, 0.7)', line=dict(width=1, color='DarkSlateGrey'))), row=1, col=2)
fig.update_xaxes(title_text="Degree")
fig.update_yaxes(title_text="Frequency")
fig.update_layout(height=500, width=1000,showlegend=False)
fig.show()
As expected both the in and out degree distributions are the power-law distribution which means small occurences are extremely common, notice that in the number of in-degree of 0 is much higher than in out-degree which makes sence since most articles cite at least another article but there is a big number of articles that no one cited so far.
bet = nx.edge_betweenness_centrality(DG)
betw_df = dictionary_to_df(bet.items(), 'Betweenness')
betw_df['Freq'] = np.log(betw_df['Freq'])/np.log(10)
fig = px.scatter(betw_df, x="Betweenness", y="Freq")
fig.update_yaxes(title_text="Log Frequency")
fig.update_layout(height=500, width=700,showlegend=False, title='Betweenness Distribution')
fig.show()
#get the nodes which the edge with highest betweenness connects:
all_bet = []
for node,deg in bet.items():
all_bet.append([node, deg])
bet_df = pd.DataFrame(all_bet, columns=['node_id', 'betweenness'])
bet_df = bet_df.sort_values(by=['betweenness'], ascending=False)
bet_df.head(2)
The edge with highest betweenness is from SOURCE to TARGET
SOURCE
Sub category: Compression (Encryption and Compression)
TARGET
Sub category: Memory Management (Operating Systems )
An edge with a high edge betweenness centrality score represents a bridge-like connector between two parts of a network, and the removal of which may affect the communication between many pairs of nodes through the shortest paths between them.
We're aiming to find the sub-catergory that is cited the most, as we said before we'll build a new graph where:
Nodes: Sub-categories.
Edges: citations between sub-categories.
Edge Weight: Strength in which one sub-category depends on the other.
# Here we create a df which contains each subcategory , its id, its category and number of articles in it
unique_subcat = cat_df.sub_cat.unique()
subcat_df = pd.DataFrame(columns=['subcat_id', 'count', 'subcat', 'category'])
i = 0
for subcat in unique_subcat:
i = i + 1
rows_list = []
#get list of all nodes in category
temp_df = cat_df.loc[cat_df['sub_cat'].isin([subcat])]
rows_list.append([i, temp_df.shape[0] ,subcat , temp_df.iloc[0]['category']])
temp_df = pd.DataFrame(rows_list, columns=['subcat_id', 'count', 'subcat', 'category'])
subcat_df = pd.concat([subcat_df, temp_df])
subcat_df.head(3)
Here we have all 62 sub-categories, each with an id, its category and count which indicates number of articles in it.
# Here we're adding cited and cites sub-category id to the edges df
new_edges = pd.merge(edges, cat_df.drop(columns='category'),left_on='cites', right_on='network_id')
new_edges = pd.merge(new_edges.drop(columns = 'network_id'), subcat_df.drop(columns=['category','count']),left_on='sub_cat', right_on='subcat')
new_edges = new_edges.rename(columns={"subcat_id": "cites_subcat"})
new_edges = pd.merge(new_edges.drop(columns=['sub_cat', 'subcat']), cat_df.drop(columns='category'),left_on='cited', right_on='network_id')
new_edges = pd.merge(new_edges.drop(columns = 'network_id'), subcat_df.drop(columns=['category','count']),left_on='sub_cat', right_on='subcat')
new_edges = new_edges.rename(columns={"subcat_id": "cited_subcat"})
new_edges = new_edges.drop(columns=['sub_cat', 'subcat'])
#Remove rows where cited and cites are same subcategory
for index, row in new_edges.iterrows():
if row['cites_subcat'] == row['cited_subcat']:
new_edges = new_edges.drop(index)
new_edges.head(3)
For each edge we'll calculate two percents, given sub-category A citing sub-category B first we'll calculate percent of articles from A cite articles from B, then we'll calculate percent of cited articles in B (by A). the product of these two percents is the edge's weight, it indicates how much A is based on B.
# The first calculation - number of aticles from a sub-category that cite articles from another sub-category.
unique_subcat = new_edges.cites_subcat.unique()
rows_list = []
for source in unique_subcat:
#get list of all nodes in category
bycites = new_edges.loc[new_edges['cites_subcat'].isin([source])]
for target in unique_subcat:
bycited = bycites.loc[new_edges['cited_subcat'].isin([target])]
bycited = bycited.drop_duplicates(subset=['cites'])
rows_list.append([source, target, bycited.shape[0]])
out_df = pd.DataFrame(rows_list, columns=['source' ,'target', 'number'])
out_df.head(3)
# The second calculation - number of aticles from a sub-category that are cited by articles from another sub-category.
rows_list = []
for target in unique_subcat:
#get list of all nodes in category
bycited = new_edges.loc[new_edges['cited_subcat'].isin([target])]
for source in unique_subcat:
bycites = bycited.loc[new_edges['cites_subcat'].isin([source])]
bycites = bycites.drop_duplicates(subset=['cited'])
rows_list.append([source, target, bycites.shape[0]])
in_df = pd.DataFrame(rows_list, columns=['source', 'target', 'number'])
in_df.head(3)
# turn the numbers into percents, calculate their product and add create the final edges df with weights.
indf = in_df.copy()
outdf = out_df.copy()
indf = pd.merge(subcat_df, indf ,left_on='subcat_id', right_on='target')
indf['percent_in'] = (indf['number']/indf['count'])
indf = indf.drop(columns=['count', 'subcat_id', 'subcat', 'category', 'number'])
indf = indf.loc[~(indf['percent_in']==0)]
outdf = pd.merge(subcat_df, outdf ,left_on='subcat_id', right_on='source')
outdf['percent_out'] = (outdf['number']/outdf['count'])
outdf = outdf.drop(columns=['count', 'number', 'subcat', 'category', 'subcat_id'])
outdf = outdf.loc[~(outdf['percent_out']==0)]
merged = pd.merge(indf, outdf, left_on=['source','target'], right_on = ['source','target'])
merged['weight'] = merged['percent_in']*merged['percent_out']
merged = merged.drop(columns=['percent_in', 'percent_out'])
merged = merged.sort_values(by=['weight'], ascending=False)
merged.head(3)
G = nx.DiGraph()
for index, row in merged.iterrows():
G.add_edge(row['source'],row['target'],weight=row['weight'])
print("Basic infomation about the new network:")
print(nx.info(G))
print("Is the Graph directed? " + str(G.is_directed()))
print("Graph Density is: " + str(nx.density(G)))
print("Average Clustering: " + str(nx.average_clustering(G, nodes=None, weight=None, count_zeros=True)))
Calculate the in degree for each node (sub-category) which is the sum of all weights on in-edges.
subcat_indegree = pd.DataFrame(G.in_degree(weight='weight'), columns=['subcat_id', 'in_degree'])
subcat_indegree = subcat_indegree.sort_values(by=['in_degree'], ascending=False)
subcat_indegree = pd.merge(subcat_indegree, subcat_df, on='subcat_id')
subcat_indegree.head(2)
Since we can, we'll also calculate the sub-category with most out-degree which is the one that relies most on others.
subcat_outdegree = pd.DataFrame(G.out_degree(weight='weight'), columns=['subcat_id', 'out_degree'])
subcat_outdegree = subcat_outdegree.sort_values(by=['out_degree'], ascending=False)
subcat_outdegree = pd.merge(subcat_outdegree, subcat_df, on='subcat_id')
subcat_outdegree.head(2)
The result is Memory Management also from Operating Systems.
We've calculated the central sub-category in the whole world (all 62 sub-categories), now we want to check if that sub-category is also the most central sub-category in it's category. Meaning if we run the same analysis we did before but this time only on the sub-categories from Operating Systems will we get the same answer?
#Get all sub-categories that belong to Operating Systems
os_nodes = subcat_df.loc[subcat_df['category'] == 'Operating Systems']
#Get all edges that are within OS
os_edges = new_edges.loc[new_edges['cites_subcat'].isin(os_nodes['subcat_id']) & new_edges['cited_subcat'].isin(os_nodes['subcat_id'])]
# The first calculation - number of aticles from a sub-category that cite articles from another sub-category.
unique_subcat = os_edges.cites_subcat.unique()
rows_list = []
for source in unique_subcat:
#get list of all nodes in category
bycites = os_edges.loc[os_edges['cites_subcat'].isin([source])]
for target in unique_subcat:
bycited = bycites.loc[os_edges['cited_subcat'].isin([target])]
bycited = bycited.drop_duplicates(subset=['cites'])
rows_list.append([source, target, bycited.shape[0]])
os_out = pd.DataFrame(rows_list, columns=['source' ,'target', 'number'])
os_out = pd.merge(os_nodes, os_out ,left_on='subcat_id', right_on='target')
os_out['percent_out'] = (os_out['number']/os_out['count'])
os_out = os_out.drop(columns=['count', 'subcat_id', 'subcat', 'category', 'number'])
os_out = os_out.loc[~(os_out['percent_out']==0)]
# The second calculation - number of aticles from a sub-category that are cited by articles from another sub-category.
rows_list = []
for target in unique_subcat:
#get list of all nodes in category
bycited = os_edges.loc[os_edges['cited_subcat'].isin([target])]
for source in unique_subcat:
bycites = bycited.loc[os_edges['cites_subcat'].isin([source])]
bycites = bycites.drop_duplicates(subset=['cited'])
rows_list.append([source, target, bycites.shape[0]])
os_in = pd.DataFrame(rows_list, columns=['source', 'target', 'number'])
os_in = pd.merge(os_nodes, os_in ,left_on='subcat_id', right_on='source')
os_in['percent_in'] = (os_in['number']/os_in['count'])
os_in = os_in.drop(columns=['count', 'number', 'subcat', 'category', 'subcat_id'])
os_in = os_in.loc[~(os_in['percent_in']==0)]
merged_os = pd.merge(os_out, os_in, left_on=['source','target'], right_on = ['source','target'])
merged_os['weight'] = merged_os['percent_in']*merged_os['percent_out']
merged_os = merged_os.drop(columns=['percent_in', 'percent_out'])
merged_os = merged_os.sort_values(by=['weight'], ascending=False)
merged_os.head(3)
Build the graph and calculate the in-degree for each node:
OS_G = nx.DiGraph()
for index, row in merged_os.iterrows():
OS_G.add_edge(row['source'], row['target'], weight=row['weight'])
os_indegree = pd.DataFrame(OS_G.in_degree(weight='weight'), columns=['subcat_id', 'in_degree'])
os_indegree = os_indegree.sort_values(by=['in_degree'], ascending=False)
os_indegree = pd.merge(os_indegree, os_nodes, on='subcat_id')
os_indegree
As we expected, the most central sub-category between the Operating Systems sub-categories is the same sub-category that is most central is the whole network which is Distributed
Check which sub-category relies most on another sub-category, meaning it has the highest weight on the edge.
temp = merged.copy()
temp = pd.merge(temp, subcat_df.drop(columns='count'), left_on='source', right_on='subcat_id')
temp = temp.rename(columns={"subcat": "source_name", "category":"source_cat"})
temp = pd.merge(temp.drop(columns='subcat_id'), subcat_df.drop(columns='count'), left_on='target', right_on='subcat_id')
temp = temp.rename(columns={"subcat": "target_name", "category":"target_cat"})
temp = temp.sort_values(by=['weight'], ascending=False)
temp = temp.drop(columns = 'subcat_id')
temp.head(2)
The edge with highest weight is Filtering IR to Retrieval IR, which means the first one relies on the second the most related to all other edges.